import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
df = pd.read_table('../datasets/geyser.dat', sep="\s+")
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 272.0 | 136.500000 | 78.663842 | 1.0 | 68.75000 | 136.5 | 204.25000 | 272.0 |
| eruptions | 272.0 | 3.487783 | 1.141371 | 1.6 | 2.16275 | 4.0 | 4.45425 | 5.1 |
| waiting | 272.0 | 70.897059 | 13.594974 | 43.0 | 58.00000 | 76.0 | 82.00000 | 96.0 |
Tiempo de la erupción en minutos Tipo: Numérico
df.eruptions.describe()
count 272.000000 mean 3.487783 std 1.141371 min 1.600000 25% 2.162750 50% 4.000000 75% 4.454250 max 5.100000 Name: eruptions, dtype: float64
df.eruptions.median()
4.0
sns.histplot(data=df,x='eruptions')
<AxesSubplot:xlabel='eruptions', ylabel='Count'>
sns.boxplot(data=df, x='eruptions')
<AxesSubplot:xlabel='eruptions'>
Tiempo entre dos erupciones en minutos
fuente: https://www.rdocumentation.org/packages/mixAK/versions/5.3/topics/Faithful
df.waiting.describe()
count 272.000000 mean 70.897059 std 13.594974 min 43.000000 25% 58.000000 50% 76.000000 75% 82.000000 max 96.000000 Name: waiting, dtype: float64
df.waiting.median()
76.0
sns.histplot(data=df,x='waiting')
<AxesSubplot:xlabel='waiting', ylabel='Count'>
sns.boxplot(data=df, x='waiting')
<AxesSubplot:xlabel='waiting'>
sns.scatterplot(data=df, x='eruptions',y='waiting')
<AxesSubplot:xlabel='eruptions', ylabel='waiting'>
Se puede observar que existe una clara distinción en dos grupos
sns.displot(data=df, x="eruptions", y="waiting", kind="kde")
<seaborn.axisgrid.FacetGrid at 0x7f4de2077a30>
from sklearn.cluster import KMeans
X = df[['eruptions', 'waiting']]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
df['cluster_kmeans'] = kmeans.predict(X)
sns.scatterplot(data=df, x='eruptions',y='waiting', hue='cluster_kmeans')
<AxesSubplot:xlabel='eruptions', ylabel='waiting'>
Primer hacemos un dbscan "común" para comparar resultados con kmeans
from sklearn.cluster import DBSCAN
X = df[['eruptions', 'waiting']]
dbscan = DBSCAN(eps=1.2, min_samples=10, metric='euclidean').fit(X)
df['cluster_dbscan'] = dbscan.fit_predict(X)
sns.scatterplot(data=df, x='eruptions',y='waiting', hue='cluster_dbscan')
<AxesSubplot:xlabel='eruptions', ylabel='waiting'>
Como segunda opción, podemos usar db scan para eliminar el ruido. De esta manera nos queda un gráfico mas limpio
X = df[['eruptions', 'waiting']]
dbscan = DBSCAN(eps=1.1, min_samples=9, metric='euclidean').fit(X)
df['cluster_dbscan2'] = dbscan.fit_predict(X)
df2 = df.loc[df['cluster_dbscan2'] != -1]
sns.scatterplot(data=df2, x='eruptions',y='waiting', hue='cluster_dbscan2')
<AxesSubplot:xlabel='eruptions', ylabel='waiting'>
from sklearn.cluster import OPTICS
X = df[['eruptions', 'waiting']]
optics = OPTICS(max_eps=4, min_samples=40).fit(X)
df['cluster_optics'] = optics.fit_predict(X)
sns.scatterplot(data=df, x='eruptions',y='waiting', hue='cluster_optics')
<AxesSubplot:xlabel='eruptions', ylabel='waiting'>
Ploteamos los clusters obtenidos quitando el ruido
df2 = df.loc[df['cluster_optics'] != -1]
sns.scatterplot(data=df2, x='eruptions',y='waiting', hue='cluster_optics')
<AxesSubplot:xlabel='eruptions', ylabel='waiting'>
Un equipo de soporte técnico maneja diariamente varios tipos de problemas distintos. Algunos problemas son más frecuentes y fáciles de resolver, mientras que otros son complejos y requieren de varias llamadas telefónicas y visitas del técnico hasta que estén resueltos. Con la finalidad de optimizar el tiempo de los técnicos, la empresa desea armar equipos de trabajo a los cuales se le asignarán un conjunto de problemas de los cuales ocuparse. La empresa desea saber cuántos equipos de trabajo conformar y qué problemas va a ser asignado a cada equipo de trabajo, intentando que dichos problemas sean similares entre si, de forma tal que se pueda seleccionar los recursos humanos que mejor se ajusten a la resolución de los mismos. Se dispone de una base de datos con todos los problemas que se tratan, con información estadística sobre los mismos (issues.csv) .
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('../datasets/issues.csv')
df
| PROBLEM_TYPE | COUNT | AVG_CALLS_TO_RESOLVE | AVG_RESOLUTION_TIME | REOCCUR_RATE | REPLACEMENT_RATE | |
|---|---|---|---|---|---|---|
| 0 | Admin Password Lost | 45 | 2.3 | 54 | 0.15 | 0.00 |
| 1 | Windows Reboots automatically | 47 | 3.1 | 132 | 0.30 | 0.03 |
| 2 | System not coming up after reboot | 12 | 4.0 | 154 | 0.02 | 0.05 |
| 3 | Slow system | 165 | 1.2 | 32 | 0.03 | 0.00 |
| 4 | Internet Connectivity loss | 321 | 1.0 | 5 | 0.21 | 0.00 |
| 5 | New Installation hangs | 22 | 3.3 | 140 | 0.14 | 0.01 |
| 6 | Intermittent Blank Screen | 23 | 4.3 | 143 | 0.21 | 0.06 |
| 7 | Too many popups in Browser | 230 | 1.3 | 23 | 0.02 | 0.00 |
| 8 | Cannot find printer | 193 | 1.2 | 33 | 0.03 | 0.00 |
| 9 | Missing peripheral driver | 24 | 2.8 | 180 | 0.04 | 0.00 |
| 10 | Cannot detect keyboard | 450 | 1.0 | 8 | 0.25 | 0.00 |
| 11 | Cannot detect mouse | 520 | 1.0 | 7 | 0.28 | 0.00 |
| 12 | Head phone jack not working | 390 | 1.0 | 9 | 0.27 | 0.00 |
| 13 | DVD read error | 140 | 1.7 | 23 | 0.05 | 0.04 |
| 14 | Cannot recover using restore | 72 | 2.3 | 125 | 0.02 | 0.00 |
| 15 | WIFI not functioning | 290 | 1.1 | 11 | 0.22 | 0.00 |
| 16 | Laptop not charging | 29 | 2.2 | 45 | 0.35 | 0.22 |
| 17 | Laptop loses charge very fast | 43 | 2.1 | 56 | 0.31 | 0.28 |
| 18 | Dark areas on screen | 78 | 2.2 | 44 | 0.19 | 0.21 |
| 19 | anti-virus not working | 170 | 1.3 | 32 | 0.04 | 0.00 |
df.PROBLEM_TYPE
0 Admin Password Lost 1 Windows Reboots automatically 2 System not coming up after reboot 3 Slow system 4 Internet Connectivity loss 5 New Installation hangs 6 Intermittent Blank Screen 7 Too many popups in Browser 8 Cannot find printer 9 Missing peripheral driver 10 Cannot detect keyboard 11 Cannot detect mouse 12 Head phone jack not working 13 DVD read error 14 Cannot recover using restore 15 WIFI not functioning 16 Laptop not charging 17 Laptop loses charge very fast 18 Dark areas on screen 19 anti-virus not working Name: PROBLEM_TYPE, dtype: object
df.PROBLEM_TYPE.describe()
count 20 unique 20 top anti-virus not working freq 1 Name: PROBLEM_TYPE, dtype: object
sns.pairplot(data=df)
<seaborn.axisgrid.PairGrid at 0x7f4dcb569670>
df.COUNT.describe()
count 20.000000 mean 163.200000 std 156.483596 min 12.000000 25% 39.500000 50% 109.000000 75% 245.000000 max 520.000000 Name: COUNT, dtype: float64
px.histogram(df, x="PROBLEM_TYPE", y="COUNT", hover_data=df.columns)